Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data

نویسنده

  • Jerzy Stefanowski
چکیده

This paper deals with inducing classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining classes (majority classes). The minority class is usually of primary interest and it is required to recognize its members as accurately as possible. Class imbalance constitutes a difficulty for most algorithms learning classifiers as they are biased toward the majority classes. The first part of this study is devoted to discussing main properties of data that cause this difficulty. Following the review of earlier, related research several types of artificial, imbalanced data sets affected by critical factors have been generated. The decision trees and rule based classifiers have been generated from these data sets. Results of first experiments show that too small number of examples from the minority class is not the main source of difficulties. These results confirm the initial hypothesis saying the degradation of classification performance is more related to the minority class decomposition into small sub-parts. Another critical factor concerns presence of a relatively large number of borderline examples from the minority class in the overlapping region between classes, in particular for non-linear decision boundaries. The novel observation is showing the impact of rare examples from the minority class located inside the majority class. The experiments make visible that stepwise increasing the number of borderline and rare examples in the minority class has larger influence on the considered classifiers than increasing the decomposition of this class. The second part of this paper is devoted to studying an improvement of classifiers by pre-processing of such data with resampling methods. Next experiments examine the influence of the identified critical data factors on performance of 4 different pre-processing re-sampling methods: two versions of random over-sampling, focused under-sampling NCR and the hybrid method SPIDER. Results show that if data is sufficiently disturbed by borderline and rare examples SPIDER and partly NCR work better than over-sampling. Jerzy Stefanowski Institute of Computing Science, Poznań University of Technology, ul. Piotrowo 2, 60–965 Poznań, Poland, e-mail: [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Impact of Local Data Characteristics on Learning Rules from Imbalanced Data

In this paper we discus improving rule based classifiers learned from class imbalanced data. Standard learning methods often do not work properly with imbalanced data as they are biased to focus on the majority classes while " disregarding " examples from the minority class. The class imbalance affects various types of classifiers, including the rule-based ones. These difficulties include two g...

متن کامل

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Learning classifiers from imbalanced data is particularly challenging when class imbalance is accompanied by local data difficulty factors, such as outliers, rare cases, class overlapping, or minority class decomposition. Although these issues have been highlighted in previous research, there have been no proposals of algorithms that simultaneously detect all the aforementioned difficulties in ...

متن کامل

Learning from Imbalanced Data in Presence of Noisy and Borderline Examples

In this paper we studied re-sampling methods for learning classifiers from imbalanced data. We carried out a series of experiments on artificial data sets to explore the impact of noisy and borderline examples from the minority class on the classifier performance. Results showed that if data was sufficiently disturbed by these factors, then the focused re-sampling methods – NCR and our SPIDER2 ...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics

Training classifiers with datasets which suffer of imbalanced class distributions is an important problem in data mining. This issue occurs when the number of examples representing the class of interest is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. We shortly review the many issues in mach...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011